Contents


Ordination is a useful tool for visualising multivariate data. These notes focus on explaining and demonstrating principal component analysis in the R programming language. Examples will use morphological data from six species of fig wasps in a community associated with the Sonoran Desert Rock Fig (Ficus petiolaris). These notes are also available as PDF and DOCX documents.



Introduction: What is ordination?

Ordination is a method for investigating and visualising multivariate data. It allows us to look at and understand variation in a data set when there are too many dimensions to see everything at once. Imagine a data set with four different variables. I have simulated one below.

Variable_1 Variable_2 Variable_3 Variable_4
Sample_1 4.5861827 4.2909065 4.483419 0.3040486
Sample_2 -2.2126922 -2.7365302 -3.340832 -3.6906426
Sample_3 -0.8453678 -2.2764760 6.406901 0.4842919
Sample_4 -1.5758203 -1.8490932 2.497495 -2.4536086
Sample_5 1.8688777 0.5200424 7.724937 2.2060064
Sample_6 4.1790096 5.4997401 3.391069 7.8061058
Sample_7 -2.3108233 -3.3433430 6.184213 -1.5824757
Sample_8 6.4586122 5.6194472 4.745236 5.4699584
Sample_9 -2.0237546 -0.9955327 7.764849 7.0316418
Sample_10 2.2055499 2.5415082 4.839646 -1.2850098
Sample_11 -3.5523320 -3.8094051 1.342192 -1.2165192
Sample_12 -3.1892416 -2.5861976 7.118288 6.5280044

We could look at the distribution of each variable in columns 1-4 individually using a histogram. Or we could use a scatterplot to look at two variables at the same time, with one variable plotted in one dimension (horizontal x-axis) and a second variable plotted orthogonally (i.e., at a right angle) in another dimension (y-axis). I have done this below with a histogram for Variable_1, and with a scatterplot of Variable_1 versus Variable_2. The numbers in the scatterplot points correspond to the sample number (i.e., rows 1-12).

Since we do not have four dimensions of space for plotting, we cannot put all four variables on a scatterplot that we can visualise with an axis for each variable. Hence, in the scatterplot to the right above, we are ignoring Variable_3 and Variable_4.

For variables 1 and 2, we can see that Sample_1 is similar to Sample_6 because the points are close together, and that Sample_1 is different from Sample_11 because the points are relatively far apart. But it might be useful to see the distance between Sample_1, Sample_6, and Sample_11 while simultaneously taking into account Variables 3 and 4. More generally, it would be useful to see how distant different points are from one another given all of the dimensions in the data set. Ordination methods are a tools that try to do this, allowing us to visualise high-dimensional variation in data in a reduced number of dimensions (usually two).

It is important to emphasise that ordination is a tool for exploring data and reducing dimensionality; it is not a method of hypothesis testing (i.e., not a type of linear model). Variables in ordinations are interpreted as response variables (i.e., \(y\) variables in models where \(y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}\)). Ordinations visualise the total variation in these variables, but do not test relationships between dependent (\(x\)) and independent (\(y\)) variables.

These notes will focus entirely on Principal Component Analysis (PCA), which is just one of many ordination methods (perhaps the most commonly used). After reading these notes, you should have a clear conceptual understanding of how a PCA is constructed, and how to read one when you see one in the literature. You should also be able to build your own PCA in R using the code below. For most biologists and environmental sciences, this should be enough knowledge to allow you use PCA effectively in your own research. I have therefore kept the maths to a minimum when explaining the key ideas, putting all of the matrix algebra underlying PCA into a single section toward the end.

Principal Component Analysis: key concepts

Imagine a scatterplot of data, with each variable getting its own axis representing some kind of measurement. If there are only two variables, as with the scatterplot above, then we would only need an x-axis and a y-axis to show the exact position of each data point in data space. If we add a third variable, then we would need a z-axis, which would be orthogonal to the x-axis and y-axis (i.e., three dimensions of data space). If we need to add a fourth variable, then we would need yet another axis along which our data points could vary, making the whole data space extremely difficult to visualise. As we add yet more variables and more axes, the position that our data points occupy in data space becomes impossible to visualise.

Principal Component Analysis (PCA) can be interpreted as a rotation of data in this multi-dimensional space. The distance between data points does not change at all; the data are just moved around so that the total variation in the data set is easier to see. If this verbal explanation is confusing, that is okay; a visual example should make the idea easier to understand. To make everything easy to see, I will start with only two dimensions using Variable_1 and Variable_2 from earler (for now, just ignore the existence of Variables 3 and 4). Notice from the scatterplot that there is variation in both dimensions of the data; the variance of Variable_1 is 11.62, and the variance of Variable_2 is 12.4. But these variables are also clearly correlated. A sample for which Variable_1 is measured to be high is also very likely measure a high value of Variable_2 (perhaps these are measurements of animal length and width).

What if we wanted to show as much of the total variation as possible just on the x-axis? In other words, rotate the data so that the maximimum amount of variance in the full data set (i.e., in the scatterplot) falls along the x-axis, with any variation remaining being left to fall along the y-axis? To do this, we need to draw a line that cuts through our two dimensions of space in the direction where the data is most stretched out (i.e., has the highest variance); this direction is our first Principal component, PC1. I have drawn it below in red (left panel).

To build our PCA, all that we need to do is take this red line and drag it to the x-axis so that it overlaps with \(y = 0\) (right panel). As we move Principal Component 1 (PC1) to the x-axis, we bring all of the data points with it, preserving their distances from PC1 and each other. Notice in the panel to the right above that the data have the same shape as they do in the panel to the left. The distances between points have not changed at all; everything has just been moved. Sample_1 and Sample_6 are just as close to each other as they were in the original scatterplot, and Sample_1 and Sample_11 are just as far away.

Principal Component 1 shows that maximum amount of variation that is possible to show in one dimension while preserving these distances between points. What little variation that remains is in PC2. Since we have maximised the amount of variation on PC1, there can be no additional covariation left between the x-axis and the y-axis. If any existed, then we would need to move the line again because it would mean that more variation could be still placed along the x-axis (i.e., more variation could be squeezed out of the covariation between variables). Hence, PC1 and PC2 are, by definition, uncorrelated. Note that this does not mean that Variable 1 and Variable 2 are not correlated (they obviously are!), but PC1 and PC2 are not. This is possible because PC1 and PC2 represent a mixture of Variables 1 and 2. They no longer represent a specific variable, but a linear combination of multiple variables.

This toy example with two variables can be extended to any number of variables and principal components. PCA finds the axis along which variation is maximised in any number of dimensions and makes that line PC1. In our example with two variables, this completed the PCA because we were only left with one other dimension, so all of the remaining variation had to go in PC2. But when there are more variables and dimensions, the data are rotated again around PC1 so that the maximum amount of variation is shown in PC2 after PC1 is fixed. For three variables, imagine a cloud of data points; first put a line through the most elonated part of the cloud and reorient the whole thing so that this most elongated part is the width of the cloud (x-axis), then spin it again along this axis so that the next widest part is the height of the cloud (y-axis). The same idea applies for data in even higher dimensional space; the data are rotated orthogonally around each additional Principal Component so that the amount of variation explained with each gets progressively smaller (in practice, however, all of this rotating is done simultaneously by the matrix algebra).

There are a few additional points to make before moving on to some real data. First, PCA is not useful if the data are entirely uncorrelated. To see why, take a look at the plot of two hypothetical (albeit ridiculous) variables in the lower left panel. The correlation between the two variables is zero, and there is no way to rotate the data to increase the amount of variation shown on the x-axis. The PCA on the lower right is basically the same figure, just moved so that the centre is on the origin.

Second, if two variables are completely correlated, the maths underlying PCA does not work because a singluar matrix is created (this is the matrix algebra equivalent of dividing by zero). Here is what it looks like visually if two variables are completely correlated.

The panel on the left shows two perfectly correlated variables. The panel on the right shows what the PCA would look like. Note that there is no variation on PC2. One hundred percent of the variation can be described using PC1, meaning that if we know the value of Variable_1, then we can certain about the value of Variable_2.

I will now move on to introduce a morphological data set collected in 2010 from a fig wasp community surrounding the Sonoran Desert rock fig (Ficus petiolaris). These data were originally published in Duthie, Abbott, and Nason (2015), and they are publicly available on GitHub.

Fig wasp morphological data

The association between fig trees and their pollinating and seed-eating wasps is a classic example of mutualism. The life-histories of figs and pollinating fig wasps are well-known and fascinating (Janzen 1979; Weiblen 2002), but less widely known are the species rich communities of non-pollinating exploiter species of fig wasps (Borges 2015). These non-pollinating fig wasps make use of resources within figs in a variety of ways; some lay eggs into developing fig ovules without pollinating, while others are parasites of other wasps. All of these wasps develop alongside the pollinators and seeds within the enclosed inflorescence of figs, and emerge as adults typically after several weeks. Part of my doctoral work focused on the community of fig wasps that lay eggs in the figs of F. petiolaris in Baja, Mexico.

The community of non-pollinators associated with F. petiolaris includes five species of the genera Idarnes (3) and Heterandrium (2). Unlike pollinators, which climb into a small hole of the enclosed infloresence and pollinate and lay eggs from inside, these non-pollinators drill their ovipositors directly into the wall of the fig.

Principal Component Analysis in R

Principal Component Analysis: matrix algebra

Conclusions

Literature cited

Borges, Renee M. 2015. “How to be a fig wasp parasite on the fig-fig wasp mutualism.” Current Opinion in Insect Science 8. Elsevier Inc: 1–7. https://doi.org/10.1016/j.cois.2015.01.011.

Duthie, A Bradley, Karen C Abbott, and John D Nason. 2015. “Trade-offs and coexistence in fluctuating environments: evidence for a key dispersal-fecundity trade-off in five nonpollinating fig wasps.” American Naturalist 186 (1): 151–58. https://doi.org/10.1086/681621.

Janzen, Daniel H. 1979. “How to be a fig.” Annual Review of Ecology and Systematics 10 (1): 13–51. https://doi.org/10.1146/annurev.es.10.110179.000305.

Weiblen, George D. 2002. “How to be a fig wasp.” Annual Review of Entomology 47: 299–330.